Project description¶

Watching movies in the original language is a popular and effective method to get pumped when learning foreign languages. It is important to choose a film that suits the student's level of difficulty, so that the student understands 50-70% of the dialogues. To meet this condition, the instructor must watch the film and decide what level it corresponds to. This requires a time commitment from the instructor; student and instructor tastes are not always the same.

project goal: develop an ML solution to automatically determine the difficulty level of English-language movies based on their subtitles. We will develop a classification for these films based on their difficulty level.

Project Initialization¶

load modules

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Wang\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Preprocessing¶

Text Normalization¶

  • First import the srt files into one big string
    • exclude all contents within brackets, which are (), [], {}
    • only get characters, digits and punctuations
  • then create a dictionary of key as film title and value as film subtitles.
dict_keys(['10 Cloverfield lane', '10 things I hate about you', 'Aladdin', 'All dogs go to heaven', 'An American tail', 'A knights tale', 'A star is born', 'Babe', 'Back to the future', 'Batman begins', 'Beauty and the beast', 'Before I go to sleep', 'Before sunrise', 'Before sunset', 'Braveheart', 'Bridget Jones diary', 'Cast away', 'Catch me if you can', 'Clueless', 'Deadpool', 'Die hard', 'Dredd', 'Dune', 'Eurovision song contest ', 'Fight club', 'Finding Nemo', 'Forrest Gump', 'Good Will Hunting', 'Groundhog day', 'Harry Potter and the philosophers stone', 'Her', 'Home alone', 'Hook', 'House of Gucci', 'Inside out', 'It s a wonderful life', 'Knives out', 'Kubo and the two strings', 'Liar liar', 'Lion', 'Logan', 'Love actually', 'Mamma Mia', 'Mary Poppins returns', 'Matilda', 'Meet the parents', 'Moulin Rouge', 'Mrs Doubtfire', 'My big fat Greek wedding', 'Notting Hill', 'Pirates of the Caribbean', 'Pleasantville', 'Powder', 'Pulp fiction', 'Ready or not', 'Shrek', 'Sleepless in Seattle', 'Soul', 'The blind side', 'The break-up', 'The cabin in the woods', 'The fault in our stars', 'The graduate', 'The greatest showman', 'The hangover', 'The holiday', 'The invisible man', 'The jungle book', 'The kings speech', 'The lion king', 'The lord of the rings', 'The man called Flintstone', 'The secret life of Walter Mitty', 'The Shawshank redemption', 'The social network', 'The terminal', 'The terminator', 'The theory of everything', 'The usual suspects', 'Titanic', 'Toy story', 'Twilight', 'Up', 'Venom', 'Warm bodies', 'We are the Millers'])

From the dictionary we then create a dataframe with each row represents one movie

subs
10 Cloverfield lane Fixed Synced by bozxphd. Enjoy The Flick BEN O...
10 things I hate about you Hey! I'll be right with you. So, Cameron. Here...
Aladdin Oh, I come from a land From a faraway place Wh...
All dogs go to heaven CAPTIONING MADE POSSIBLE BY MGM HOME ENTERTAIN...
An American tail MAMA Tanya, Fievel? Will you stop that twirlin...
... ...
Twilight I'd never given much thought to how I would di...
Up Movietown News presents Spotlight on Adventure...
Venom Life Foundation Control, this is LF1. The spec...
Warm bodies What am I doing with my life? I'm so pale. I s...
We are the Millers Oh, my God... ...it's full on double rainbow a...

86 rows × 1 columns

Text Lemmatization¶

  • lemmatize the word in subtitles to root form ==> to calculate the vocabulary coverage
  • To note:
    • will exclude interjection tokens (e.g. 'ah, haha, ahh'), white space tokens, tokens belong to the stop words
    • will also try to exclude tokens which are NAMES!
subs subs_lemma
10 Cloverfield lane Fixed Synced by bozxphd. Enjoy The Flick BEN O... fix sync bozxphd enjoy flick ben phone michell...
10 things I hate about you Hey! I'll be right with you. So, Cameron. Here... right cameron go nine school year army brat en...
Aladdin Oh, I come from a land From a faraway place Wh... come land faraway place caravan camel roam fla...
All dogs go to heaven CAPTIONING MADE POSSIBLE BY MGM HOME ENTERTAIN... captioning make possible mgm home entertainmen...
An American tail MAMA Tanya, Fievel? Will you stop that twirlin... tanya fievel stop twirl twirl time bed come ha...
... ... ...
Twilight I'd never given much thought to how I would di... never give much thought die die place someone ...
Up Movietown News presents Spotlight on Adventure... movietown news present spotlight adventure wit...
Venom Life Foundation Control, this is LF1. The spec... life foundation control lf specimen secure hea...
Warm bodies What am I doing with my life? I'm so pale. I s... life pale get eat well posture terrible stand ...
We are the Millers Oh, my God... ...it's full on double rainbow a... full double rainbow across sky whoa god whoo m...

86 rows × 2 columns

one example of the subtitles of 10 things i hate about you after the abovie mentioned lemmatization.

"right cameron go nine school year army brat enough sure find padua different old school little ass wipe brain everywhere excuse say right office anymore get deviant see novel finish thank thank lot patrick verona see make visit weekly ritual moment together hit light clever kangaroo boy say expose cafeteria joke lunch lady optimist next time keep pouch michael eckman suppose show around thank god know normally send one audio visual geek know mean michael put slide michael cameron breakdown get basic beautiful people listen unless talk first bother wait rule watch eat see left coffee kid costa rican butthead edgy make sudden movement around delusional white rasta big marley fan think black semi political mostly smoke lot weed guy wait wait let guess cowboy close come cow mcdonald mcdonald future mvys ivy league accept yuppie greed back friend guy close bogey yesterday god happen bogey lozenstein start buy izod outlet mall kick hostile takeover worry pay group even think group bianca stratford sophomore burn pine perish course know beautiful deep sure see difference like love like skecher love prada backpack love skecher prada backpack listen forget incredibly uptight father widely know fact stratford sister allow date whatever everyone think sun also rise love romantic romantic hemingway abusive alcoholic misogynist squander half life hang around picasso try nail leftover oppose bitter self righteous hag friend pipe chachi guess society male asshole make worthy time sylvia plath charlotte bronte simone de beauvoir oppressive patriarchal value dictate education good mr morgan chance could get kat take midol come class someday go get bitch slap go thing stop kat want thank point view know difficult must overcome year upper middle class suburban oppression must tough next time storm crusade well lunch meat whatever white girl complain ask buy book write black man right mon even get start two anything else go office piss mr morgan later undulate desire adrienne remove red crimson cape sight reginald stiff judith another word engorge look okay swollen turgid tumescent perfect hear terrorize mr morgan class express opinion terrorist action way express opinion bobby ridgezay way testicle retrieval operation go quite well case interested still maintain kick ball point kat cat people perceive somewhat tempestuous heinous bitch term use often might want work thank always thank excellent guidance let get back reginald quiver member quiver member like virgin alert favorite lookin good lady reach even one reach wanna put money money get go fun guy joey donner jerk model model model mostly regional stuff rumor big tube sock ad come really really man look always vapid say totally conceite talk think look way smile look eye man totally pure miss cameron snotty little princess wear strategically plan sun dress make guy like realize never touch guy like joey realize want friend spend rest life put spank bank move move wrong mean know spanking part rest wrong right wrong wanna take shot guest actually look french tutor serious perfect speak french hey little rambo look kat read last month cosmo run along know overwhelm whelm everjust whelm think europe lady szeet young thing like ride careful leather charming new development disgusting remove head sphincter drive minor encounter shrew girlfriend sister bianca sister mewling rampalian wretch stay cool bro see later look ball katarina make anyone cry today sadly daddy precious nowhere say sarah lawrence get honey great sarah lawrence side country thus basis appeal think decide go stay go school dub like husky decide pick leave let us hope ask bianca drive home kat change drove drive home get upset daddy boy flame imbecile think might ask think know go ask think know answer always two house rule number one dating till graduate number two dating till graduate daddy unfair right want know unfair morning deliver set twin year old girl know say crack whore make skeezy boyfriend wear condom close say listen father well say dope focus second girl school date sister date intend see unwashed miscreant go school come planet loser oppose planet look look solve one old rule new rule bianca date mutant never date never date like get sleep night deep slumber father whose daughter impregnate talk sarah lawrence later fine wait daddy get go find blind deaf retard take movie one date sorry look like witty repartee joey eat donner suck make quick roxanne corinne andrew jarrett incredibly horrendous public break quad think start pronunciation right hacking gag spit part alternative french food could eat together saturday night ask cute name cameron listen know let date think french class wait minute curtis cameron come new rule date sister kid let ask like sailing 'cause read place rent boat beaucoup problemo calvin case hear sister particularly hideous breed loser notice little antisocial unsolved mystery use really popular like get sick something theory abound pretty sure incapable human interaction plus bitch sure know lot guy mind go difficult woman mean know people jump airplane ski cliff like extreme dating think could find someone extreme sure hell mean know could look gather group guy could perfect padua fine interested date katarina stratford never rip maybe last two people alive sheep sheep tell pointless one go look criminal hear light state trooper fire year san quentin well least horny serious man whack sell liver black market new set speaker guy listen later get date kat know mean could pay money need backer someone money stupid peach fruit roll see many right lose actually come chat chat actually think run idea see interested well hear want bianca right go sister insane head case one go right conversation purpose think need need hire guy go someone scare easily guy hear eat live duck everything beak foot clearly solid investment walk hall say say get cool association think get involve relax relax let pretend call shot busy set thing time bianca good idea right dick face remember guy grip rip great duck last night know see girl kat stratford want go sure sparky look take sister kat start date see whack get rule girl touching story really problem willing make problem provide generous compensation go pay take chick much twenty buck fine let us think go movie buck get popcorn want raisinet right look buck negotiation take leave trailer park fifty buck get deal fabio great practice everybody good hustle stratford thank mr chapin girlie sweat like pig actually way get guy attention mission life obviously strike fancy see work world make sense pick friday right friday night take place never like eleven broadway even know name screw know lot think doubtful doubtful screw want hear defeatist attitude want hear upbeat screw go coach chapin run bogey ever consider new look mean seriously could definite potential bury hostility hostile annoyed try people know think forget care people think always want know happen like adore thank get pearl mom hide three year daddy find drawer last week go start wear like come back claim besides look good trust ride vintage fender follow laundromat see car come say big talker depend topic fender really whip verbal frenzy afraid afraid well people maybe afraid sure think naked transparent want need baby baby asshole day mind bitch whoop whoop insurance cover pms tell seizure sarah lawrence punish want stay close home punish mom leave think could leave fine stop make decision father right want matter know want know want till even get old use want go east coast school want trust make choice want stop try control life control know want continue later wait maim joey car look like go take bus fact completely psycho manage escape attention daddy shell expect result watch bitch violate car count date get get get price hundred buck date advance forget well forget sister well hope smooth think verona go go go know try kat stratford right plan help situation man cameron majorjone bianca stratford chick beer flavor nipple think speak correctly say cameron love pure purer say joey donner cash donner plow whoever want plowing patrick pat let explain something set whole thing cameron get girl cameron joey pawn two go help tame wild beast absolutely research find like guy mean strictly non prison movie type way let us start friday night bogey lozenstein party perfect opportunity perfect opportunity take kat think little payback go party let us really important one like well think like white shirt well pensive damn go thoughtful go bogey lozenbrau thing friday night might good 'cause know go bother see right hear bogey lowenstein party really really really want go know unless sister know work far know go guy lang fan find picture jared leto drawer pretty sure harbor sex tendency kind guy like like pretty guy know ever hear say die date guy smoke right smoking else ask investigate inner working sister twisted mind think nothing else work need go behind enemy line go class schedule reading list date book concert ticket concert ticket black pantie tell want sex someday could like color buy black lingerie unless want someone see see room girl room personal bike think bar look like touch anything may get hepatitis get little insight complicated girl excuse one question start drink alcohol liver nothing nothing right first thing kat hate smoker tell non smoker another problem bianca say kat like pretty guy tell pretty guy pretty gorgeous guy sure know right like thai food feminist prose angry girl music indie rock persuasion list cd room suppose buy noodle book sit around listen chick play instrument right ever club skunk favorite band playing tomorrow night see club skunk right get ticket assail ear one night pair black underwear help could hurt right verona need agua two water plan ask might well get mind kind ruin surround usual cloud smoke know quit apparently bad think know guy bikini kill raincoat bad know raincoat watch never see look sexy come bogey party never give see use window daddy go well must know small study group friend otherwise know orgy mr stratford party hell sauna know anything party people expect kat go go normal define normal bogey lozenstein party normal bogey lozenstein bogey party lame excuse idiot school drink beer rub hope distract pathetic emptiness meaningless meaningless consumer drive live meaningless consumer drive life forjust one night forget completely wretched sister come kat fine make appearance start party daddy want wear belly daddy night around living room minute understand full weight decision perfectly aware listen every time even think kiss boy want picture wear halter top completely unbalanced go right wait minute drinking drug kissing tattoo piercing ritual animal slaughter kind god give idea daddy right early whatever drive knock sister bianca say right wear kenneth cole dress think mix genre right fact notice direct listen really mean something tell part already think time stop self involve one minute look look like great uncle milton think lose tie maybe right nervous also excited nervous excited mixed know right calm right last party go chuck cheese want talk fun good time remember guy touch anything tell must nigel brie know think get tercel toyota dual side air bag spacious back seat kiss kiss good thank man sweet look fresh tonight pussycat wait hairline recede going away sister stay away sister stay away sister guarantee stay away fight fight guy take outside thank kat look find bianca wait address public something need tell busy enjoy adolescence scamper want one right sister look place getting trash man suppose party know say want funny one later lord dance heather bite keep tie see around anywhere relax relax fine follow love bianca cameron know chastity think art together right neat really look amazing thank know look amazing bianca let us go congregate around mr cuervo see around get sears catalog thing go tube sock gig go huge hemorrhoid cream ad next week know sound kind bogus get act see underwear show bathing suit one see difference right show guy party sudden suck really really thank kat let one one mine man get act like human right go see fine fine come need lie lie go sleep sleep good concussion come sit sit need talk little busy right give second whole thing talk never want want joey whole time cameron like girl worth trouble think know see first joey half man secondly let anyone ever make feel like deserve want go come patronizing leave use big word smash think kat tell may concussion care never wake sure start take girl actually like like could find one see need affection blind hatred let sit right let get joey hate choose perfect revenge mainline tequila know say nope say kat come wake look listen kat open eye eye little green know go bunch go jarrett ready home minute know home till one chance man damn shame wanna go sure chastity pass bitch fun tonight ton cameron think could give ride home start band install car stereo start band father love strike type ask father permission think know get thing people know scary picnic pain ass want somebody bianca bianca offense anything mean know everyone dig sister without know vile think maybe another time never want go sail actually say always selfish know 'cause beautiful mean treat people like matter mean really like defend people call conceite help ask learn french blow could back game kat lady szay rhythm heart dance cozgirl kat babe oze table dance right give damn everybody weekend know maybe ask kat unless kick crap dumb butt wanna hear let us open book page sonnet listen faith love thee mine eye thee thousand error note tis heart love despise despite view pleas'd dote know shakespeare dead white guy knozs overlook want write version sonnet opinion everything want iambic pentameter go fight think really good assignment messin really look forward write get class get thank mr morgan shut cool picture collar keep lick stitch kid know fan shakespeare fan involve could refrain heart love heart courage make love know macbeth right listen friend like friend anything drunk remember plan work care think want kiss car sorry dweeb putz sorry right talk get scoop say hate fire thousand sun direct quote thank michael comforting know could need day cool maybe two imagine go antiquated mating ritual date really want get dress drakkar noir wear dexter boner feel force listen band definition blozs right right go get dress look entirely wrong perspective make statement goody something new different cupid joey concentrate awfully hard consider gym class help want talk prom look know deal go kat go sister go since let us say take care take care flozer limo tux everything make sure get prom know sick play little game wait wait wait sick let us say excuse see feminine mystique lose copy hear poetry read charming wholesome unwelcome mean think know badass think someone still pantie twist one minute think effect whatsoever pantie effect upchuck reflex nothing right still piss sweet love renew thy force say like people hear look embarrass girl sacrifice altar dignity even score listen say like people hear good true take eye like heaven touch wanna hold much long last love arrive thank god l'm alive good true take eye love baby quite right need baby warm lonely night love baby trust say pretty baby bring pray pretty baby find stay let love baby let love look pretty nervous sir szeate like pig sir eye bloodshot sir get pot confiscate mr chapin talk second stratford idea improve girl soccer team great let us talk later window know really big game hillcr high bicep huge god one even big take steroid hear steroid severely disintegrate package think package point let us hope point kick butt every year think devise plan enable finally defeat thing teach thing misdirection teach siegfrie roy anyway important think look leave run right bang score win get look leave like see plan go go show plan someone else thank enough help sneak detention cool problem think sure bust climb window tell ya keep distract dazzle wit excuse act way like people expect live people expectation instead disappoint start cover right something like screze never disappoint right come none stuff true state trooper fallacy dead guy parking lot duck hearsay bobby ridgezay ball fact deserve try grope lunch line fair enough accent real live australia pygmy close mom last year know porn career lie tell something true something true hate pea something real something one else knozs szeet sexy completely hot amazingly self assure anyone ever tell tell every day actually go prom request command come go want stupid tradition come people expect go push need motive want tell need therapy know anyone ever tell anszer question patrick nothing nothing pleasure company wait wait minute page seven good daddy honey like discuss tomorrow night know prom prom kat date think fool know wanna bend rule hot rod joey hot rod sister go go end story let us review kat interested die go know happen prom daddy dance kiss come home quite crisis situation imagine kiss think happen get nezs kissing keep elbozs placenta day long two second ignore fact severely unhinge discuss need night teenage normalcy normal damn dazson river kid sleep bed whatnot daddy get nezs ya get go get jiggy boy care dope ride mama raise fool thank bill ridiculous amount love across nation worldwide believe true story seattle come listen know know hate sit home suzy high school like care care firm believer something reason someone else wish luxury know sophomore got ask go prom go feel like joey never tell go ninth month like babe hate joey happened tell joke right mom leave everyone afterwards tell want anymore ready got piss dump szore never anything everyone else since exception bogey party stunning digestive pyrotechnic possible know warn tell anyone cheerleading squad find tiny dick okay tell want let make mind help daddy hold hostage stupid enough repeat mistake guess think protect let experience anything experience good bianca always trust people want guess never know lady thin hair bald spot solve problem instantly paint cover lt amazing powder cling tiny hair head lt actually build leave great great look hair hair system expensive interesting order hair package go prom funny szeetie lt instantly cover bald spot leave great look hair prom dress seem hear word lot lately daddy stop turn explain remember say could date kat date find guy actually kind perfect perfect cameron ask go prom really really wanna go since kat go guess alloze base aforementioned rule previous stipulation course meet let us go know every cop town bucko good get tux last minute something know lie around get dress something know lie around listen really sorry question motive wrong forgive ready prom ma'am mr stratford joey pick bianca see william ask meet mandella tell progress full hallucination milady good sir god call favor know think sophomore prom joey pick congratulation generous princess know joey like one reason even bet go friend go nail tonight milwaukee last year jail know marilyn manson sleep spice girl think see grandpa ill spend year couch watching wheel offortune make spaghettio end story way bianca cheese dick pay take kat little punk could snake bianca nothing hath hitteth faneth joey pal compadre mess wrong guy go pay little bitch right enough cross line come get little punk bianca shoot nose spray ad tomorrow make date bleed sister okay never well give chance pay take one person truly hate know set kat like like payment bonus sleep care money care care think want thank sure want go sail fun fine look know ever thank go last night really mean lot glad ready see later hope sister go meet biker big one full sperm funny tell dance hoppin part part part bianca beat hell guy bianca matter upset rub impressed father like admit daughter capable run life mean become spectator bianca still let play inning bench year go sarah lawrence even able watch game go boy tell change mind already send check right assume everyone find time complete poem except mr donner excuse shaft lose glass right anyone brave enough read aloud lord go hate way talk way cut hair hate way drive car hate stare hate big dumb combat boot way read mind hate much make sick even make rhyme hate hate way always right hate lie hate make laugh even bad make cry hate around fact call mostly hate way hate even close even little bit even fender strat think could use know start band besides extra cash know asshole pay take really great girl right screze fall really every day find girl flash someone get detention god buy guitar every time screw know know always drum bass maybe even one day tambourine think offense know everyone dig sister without know vile think suck let us go messin really look forward go see perky go perky go perky second one perky perky right away perky perky perky beginning shot perk bianca let us go congregate around mr cuervo see around worry well right come want long mess wrong guy go pay little bitch right enough cross line kid drive pick tune car want coffee could get prophylactic prophylactic let go could set like god want completely damage send therapy forever want lady shall go office"

Create features¶

create features (numbers) from texts, so we can build our machine learning model

raw text features¶

  • mean and std value of sentence length (number of non punctuation tokens in one sentence)
  • number of sentences
  • mean and std values of token length (number of characters in a token)
  • the number of lemmas and the number of unique lemmas
subs subs_lemma tok_cnt uniq_tok_cnt sent_len sent_len_std sent_cnt tok_len tok_len_std
10 Cloverfield lane Fixed Synced by bozxphd. Enjoy The Flick BEN O... fix sync bozxphd enjoy flick ben phone michell... 2007 759 5.84 4.93 891 3.78 1.91
10 things I hate about you Hey! I'll be right with you. So, Cameron. Here... right cameron go nine school year army brat en... 3542 1254 5.95 5.41 1488 3.75 1.91
Aladdin Oh, I come from a land From a faraway place Wh... come land faraway place caravan camel roam fla... 3578 1169 5.54 6.42 1533 3.75 1.89
All dogs go to heaven CAPTIONING MADE POSSIBLE BY MGM HOME ENTERTAIN... captioning make possible mgm home entertainmen... 3460 909 4.54 4.49 1717 3.77 1.81
An American tail MAMA Tanya, Fievel? Will you stop that twirlin... tanya fievel stop twirl twirl time bed come ha... 2185 678 4.79 5.85 1100 3.75 1.81

Lexial features¶

  • vocabulary difficult level
    • calculate the number of subtitles' words according to the CEFR difficulty level, include the words which do not fit into any category
The oxford 3000 dictionary by CEFR levle has 4563 words
subs subs_lemma tok_cnt uniq_tok_cnt sent_len sent_len_std sent_cnt tok_len tok_len_std a1_% a2_% b1_% b2_% c1_% other_% other_word
10 Cloverfield lane Fixed Synced by bozxphd. Enjoy The Flick BEN O... fix sync bozxphd enjoy flick ben phone michell... 2007 759 5.84 4.93 891 3.78 1.91 1261 242 121 48 32 303 sync bozxphd flick ben michelle michelle mich...
10 things I hate about you Hey! I'll be right with you. So, Cameron. Here... right cameron go nine school year army brat en... 3542 1254 5.95 5.41 1488 3.75 1.91 2074 351 165 96 75 781 cameron year brat padua ass anymore deviant p...
Aladdin Oh, I come from a land From a faraway place Wh... come land faraway place caravan camel roam fla... 3578 1169 5.54 6.42 1533 3.75 1.89 1717 442 308 96 81 934 faraway caravan camel roam barbaric hop arabi...
All dogs go to heaven CAPTIONING MADE POSSIBLE BY MGM HOME ENTERTAIN... captioning make possible mgm home entertainmen... 3460 909 4.54 4.49 1717 3.77 1.81 1831 391 142 85 58 953 captioning mgm itchy tap yow itchy idgi idgi ...
An American tail MAMA Tanya, Fievel? Will you stop that twirlin... tanya fievel stop twirl twirl time bed come ha... 2185 678 4.79 5.85 1100 3.75 1.81 1314 184 119 37 29 502 tanya fievel twirl twirl hanukkah hanukkah fi...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Twilight I'd never given much thought to how I would di... never give much thought die die place someone ... 3820 1108 5.50 5.93 1775 3.68 1.81 2367 483 171 79 53 667 phoenix erratic harebrained okay renee thorn ...
Up Movietown News presents Spotlight on Adventure... movietown news present spotlight adventure wit... 2545 865 4.86 4.59 1270 3.78 1.96 1386 281 133 73 33 639 movietown witness civilized world america lur...
Venom Life Foundation Control, this is LF1. The spec... life foundation control lf specimen secure hea... 3390 1047 5.01 4.40 1740 3.72 1.90 1881 400 142 145 78 744 foundation lf roger lf reentry reentry shit l...
Warm bodies What am I doing with my life? I'm so pale. I s... life pale get eat well posture terrible stand ... 1873 639 4.93 3.99 1014 3.63 1.86 1154 261 81 46 39 292 posture straighter jesus wish anymore hoodie ...
We are the Millers Oh, my God... ...it's full on double rainbow a... full double rainbow across sky whoa god whoo m... 6439 1609 4.88 4.21 3295 3.63 1.80 3607 763 295 144 116 1514 rainbow whoa whoo vivid triple rainbow streak...

86 rows × 16 columns

  • Lexical diversity
    • Ratio between the total number of words and the total number of unique words
    • based on either raw text or lemmatized text
subs subs_lemma tok_cnt uniq_tok_cnt sent_len sent_len_std sent_cnt tok_len tok_len_std a1_% a2_% b1_% b2_% c1_% other_% other_word lex_div lemma_div
10 Cloverfield lane Fixed Synced by bozxphd. Enjoy The Flick BEN O... fix sync bozxphd enjoy flick ben phone michell... 2007 759 5.84 4.93 891 3.78 1.91 1261 242 121 48 32 303 sync bozxphd flick ben michelle michelle mich... 0.231 0.378
10 things I hate about you Hey! I'll be right with you. So, Cameron. Here... right cameron go nine school year army brat en... 3542 1254 5.95 5.41 1488 3.75 1.91 2074 351 165 96 75 781 cameron year brat padua ass anymore deviant p... 0.204 0.354
Aladdin Oh, I come from a land From a faraway place Wh... come land faraway place caravan camel roam fla... 3578 1169 5.54 6.42 1533 3.75 1.89 1717 442 308 96 81 934 faraway caravan camel roam barbaric hop arabi... 0.212 0.327
All dogs go to heaven CAPTIONING MADE POSSIBLE BY MGM HOME ENTERTAIN... captioning make possible mgm home entertainmen... 3460 909 4.54 4.49 1717 3.77 1.81 1831 391 142 85 58 953 captioning mgm itchy tap yow itchy idgi idgi ... 0.157 0.263
An American tail MAMA Tanya, Fievel? Will you stop that twirlin... tanya fievel stop twirl twirl time bed come ha... 2185 678 4.79 5.85 1100 3.75 1.81 1314 184 119 37 29 502 tanya fievel twirl twirl hanukkah hanukkah fi... 0.214 0.310
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Twilight I'd never given much thought to how I would di... never give much thought die die place someone ... 3820 1108 5.50 5.93 1775 3.68 1.81 2367 483 171 79 53 667 phoenix erratic harebrained okay renee thorn ... 0.173 0.290
Up Movietown News presents Spotlight on Adventure... movietown news present spotlight adventure wit... 2545 865 4.86 4.59 1270 3.78 1.96 1386 281 133 73 33 639 movietown witness civilized world america lur... 0.220 0.340
Venom Life Foundation Control, this is LF1. The spec... life foundation control lf specimen secure hea... 3390 1047 5.01 4.40 1740 3.72 1.90 1881 400 142 145 78 744 foundation lf roger lf reentry reentry shit l... 0.184 0.309
Warm bodies What am I doing with my life? I'm so pale. I s... life pale get eat well posture terrible stand ... 1873 639 4.93 3.99 1014 3.63 1.86 1154 261 81 46 39 292 posture straighter jesus wish anymore hoodie ... 0.213 0.341
We are the Millers Oh, my God... ...it's full on double rainbow a... full double rainbow across sky whoa god whoo m... 6439 1609 4.88 4.21 3295 3.63 1.80 3607 763 295 144 116 1514 rainbow whoa whoo vivid triple rainbow streak... 0.151 0.250

86 rows × 18 columns

Spacy Dependency parsing tree height¶

  • The height of the dependency tree can be an indicator for text complexity.
  • The heigher the tree grows, the more complex the gramatic structure become.
title subs subs_lemma tok_cnt uniq_tok_cnt sent_len sent_len_std sent_cnt tok_len tok_len_std ... b1_% b2_% c1_% other_% other_word lex_div lemma_div tree_height tree_height_std max_tree_height
0 10 Cloverfield lane Fixed Synced by bozxphd. Enjoy The Flick BEN O... fix sync bozxphd enjoy flick ben phone michell... 2007 759 5.84 4.93 891 3.78 1.91 ... 121 48 32 303 sync bozxphd flick ben michelle michelle mich... 0.231 0.378 3.03 1.56 11
1 10 things I hate about you Hey! I'll be right with you. So, Cameron. Here... right cameron go nine school year army brat en... 3542 1254 5.95 5.41 1488 3.75 1.91 ... 165 96 75 781 cameron year brat padua ass anymore deviant p... 0.204 0.354 3.01 1.55 14
2 Aladdin Oh, I come from a land From a faraway place Wh... come land faraway place caravan camel roam fla... 3578 1169 5.54 6.42 1533 3.75 1.89 ... 308 96 81 934 faraway caravan camel roam barbaric hop arabi... 0.212 0.327 2.82 1.52 10
3 All dogs go to heaven CAPTIONING MADE POSSIBLE BY MGM HOME ENTERTAIN... captioning make possible mgm home entertainmen... 3460 909 4.54 4.49 1717 3.77 1.81 ... 142 85 58 953 captioning mgm itchy tap yow itchy idgi idgi ... 0.157 0.263 2.59 1.33 9
4 An American tail MAMA Tanya, Fievel? Will you stop that twirlin... tanya fievel stop twirl twirl time bed come ha... 2185 678 4.79 5.85 1100 3.75 1.81 ... 119 37 29 502 tanya fievel twirl twirl hanukkah hanukkah fi... 0.214 0.310 2.59 1.38 10
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
81 Twilight I'd never given much thought to how I would di... never give much thought die die place someone ... 3820 1108 5.50 5.93 1775 3.68 1.81 ... 171 79 53 667 phoenix erratic harebrained okay renee thorn ... 0.173 0.290 2.82 1.36 9
82 Up Movietown News presents Spotlight on Adventure... movietown news present spotlight adventure wit... 2545 865 4.86 4.59 1270 3.78 1.96 ... 133 73 33 639 movietown witness civilized world america lur... 0.220 0.340 2.70 1.42 10
83 Venom Life Foundation Control, this is LF1. The spec... life foundation control lf specimen secure hea... 3390 1047 5.01 4.40 1740 3.72 1.90 ... 142 145 78 744 foundation lf roger lf reentry reentry shit l... 0.184 0.309 2.72 1.39 10
84 Warm bodies What am I doing with my life? I'm so pale. I s... life pale get eat well posture terrible stand ... 1873 639 4.93 3.99 1014 3.63 1.86 ... 81 46 39 292 posture straighter jesus wish anymore hoodie ... 0.213 0.341 2.67 1.32 9
85 We are the Millers Oh, my God... ...it's full on double rainbow a... full double rainbow across sky whoa god whoo m... 6439 1609 4.88 4.21 3295 3.63 1.80 ... 295 144 116 1514 rainbow whoa whoo vivid triple rainbow streak... 0.151 0.250 2.69 1.36 10

86 rows × 22 columns

Spacy Part of speech (POS) count¶

consider the following univsersal POS tag
'UH', 'PRP', 'VBP', 'RB', 'VBG', 'DT', 'NNP', 'NN', 'CC', 'WP', 'JJ', 'IN', 'RP', 'PRP$', 'NNS', 'VBZ', 'JJS', 'VBN', 'CD', 'TO', 'VB', 'WRB', 'VBD', 'MD', 'POS', 'JJR', 'WDT', 'NNPS', 'RBR', 'FW', 'EX', 'PDT', 'RBS', 'LS', 'WP$']

We will count their appearances for each movie

UH PRP VBP RB VBG DT NNP NN CC WP ... WDT NNPS RBR FW EX PDT RBS LS WP$ pos_tag_entrophy
0 156 797 291 462 127 405 180 515 136 85 ... 17 6 5 0 17 6 1 1 0 0.818
1 322 1386 624 759 175 638 384 909 172 125 ... 23 2 4 1 17 9 1 0 1 0.809
2 350 1138 449 662 108 693 577 913 144 101 ... 6 1 13 5 15 13 10 2 2 0.813
3 180 783 349 350 93 552 2226 718 159 106 ... 6 7 2 15 11 5 1 1 0 0.736
4 278 636 286 456 87 397 459 568 126 61 ... 13 8 4 4 26 1 2 2 0 0.816
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
81 379 1640 633 918 194 605 414 891 170 122 ... 19 8 7 1 17 16 5 0 0 0.809
82 321 849 323 477 101 439 464 641 124 52 ... 7 4 3 1 11 13 3 0 0 0.811
83 486 1354 629 762 208 588 475 778 158 127 ... 24 2 4 4 18 7 3 1 0 0.812
84 195 828 399 490 111 322 173 429 86 64 ... 13 5 6 2 14 8 0 0 0 0.807
85 1052 2278 1039 1333 369 1240 866 1654 289 258 ... 23 17 10 4 16 15 0 0 1 0.808

86 rows × 36 columns

  • some pos tags appeared less frequently
  • if the mean counts < 100, we drop them
['JJS',
 'CD',
 'WRB',
 'POS',
 'JJR',
 'WDT',
 'NNPS',
 'RBR',
 'FW',
 'EX',
 'PDT',
 'RBS',
 'LS',
 'WP$']

We merge the pos tags cnt dataframe with our previous movie dataframe.

title subs subs_lemma tok_cnt uniq_tok_cnt sent_len sent_len_std sent_cnt tok_len tok_len_std ... RP PRP$ NNS VBZ VBN TO VB VBD MD pos_tag_entrophy
0 10 Cloverfield lane Fixed Synced by bozxphd. Enjoy The Flick BEN O... fix sync bozxphd enjoy flick ben phone michell... 2007 759 5.84 4.93 891 3.78 1.91 ... 58 83 110 173 76 105 360 248 79 0.818
1 10 things I hate about you Hey! I'll be right with you. So, Cameron. Here... right cameron go nine school year army brat en... 3542 1254 5.95 5.41 1488 3.75 1.91 ... 95 163 194 300 102 138 644 243 164 0.809
2 Aladdin Oh, I come from a land From a faraway place Wh... come land faraway place caravan camel roam fla... 3578 1169 5.54 6.42 1533 3.75 1.89 ... 86 187 194 261 128 167 718 138 194 0.813
3 All dogs go to heaven CAPTIONING MADE POSSIBLE BY MGM HOME ENTERTAIN... captioning make possible mgm home entertainmen... 3460 909 4.54 4.49 1717 3.77 1.81 ... 48 100 169 269 57 79 487 180 79 0.736
4 An American tail MAMA Tanya, Fievel? Will you stop that twirlin... tanya fievel stop twirl twirl time bed come ha... 2185 678 4.79 5.85 1100 3.75 1.81 ... 49 103 142 141 41 62 448 108 109 0.816
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
81 Twilight I'd never given much thought to how I would di... never give much thought die die place someone ... 3820 1108 5.50 5.93 1775 3.68 1.81 ... 101 191 201 350 86 201 769 339 228 0.809
82 Up Movietown News presents Spotlight on Adventure... movietown news present spotlight adventure wit... 2545 865 4.86 4.59 1270 3.78 1.96 ... 80 158 132 240 72 88 559 112 140 0.811
83 Venom Life Foundation Control, this is LF1. The spec... life foundation control lf specimen secure hea... 3390 1047 5.01 4.40 1740 3.72 1.90 ... 72 137 198 319 98 185 639 194 147 0.812
84 Warm bodies What am I doing with my life? I'm so pale. I s... life pale get eat well posture terrible stand ... 1873 639 4.93 3.99 1014 3.63 1.86 ... 52 77 115 164 55 102 411 133 113 0.807
85 We are the Millers Oh, my God... ...it's full on double rainbow a... full double rainbow across sky whoa god whoo m... 6439 1609 4.88 4.21 3295 3.63 1.80 ... 185 299 352 545 139 243 1202 383 222 0.808

86 rows × 44 columns

Textstat¶

textstat PyPI

  • textstat library to calculate various statistics from text
  • the statistics' score will help identify the text complexity, readblity and grade level.
  • we assume the statistics can be used to evaluate movie subtitles
Requirement already satisfied: textstat in c:\users\wang\anaconda3\lib\site-packages (0.7.3)
Requirement already satisfied: pyphen in c:\users\wang\anaconda3\lib\site-packages (from textstat) (0.12.0)
Note: you may need to restart the kernel to use updated packages.
title subs subs_lemma tok_cnt uniq_tok_cnt sent_len sent_len_std sent_cnt tok_len tok_len_std ... difficult_words linsear_write_formula gunning_fog text_standard fernandez_huerta szigriszt_pazos gutierrez_polini crawford gulpease_index osman
0 10 Cloverfield lane Fixed Synced by bozxphd. Enjoy The Flick BEN O... fix sync bozxphd enjoy flick ben phone michell... 2007 759 5.84 4.93 891 3.78 1.91 ... 290 3.500000 3.80 3rd and 4th grade 127.19 124.16 54.29 -0.3 86.2 91.98
1 10 things I hate about you Hey! I'll be right with you. So, Cameron. Here... right cameron go nine school year army brat en... 3542 1254 5.95 5.41 1488 3.75 1.91 ... 548 3.153846 4.04 2nd and 3rd grade 126.78 123.17 54.31 -0.2 84.4 91.92
2 Aladdin Oh, I come from a land From a faraway place Wh... come land faraway place caravan camel roam fla... 3578 1169 5.54 6.42 1533 3.75 1.89 ... 431 58.000000 3.97 5th and 6th grade 126.58 122.91 54.30 0.0 83.1 92.02
3 All dogs go to heaven CAPTIONING MADE POSSIBLE BY MGM HOME ENTERTAIN... captioning make possible mgm home entertainmen... 3460 909 4.54 4.49 1717 3.77 1.81 ... 284 2.625000 3.47 2nd and 3rd grade 126.78 125.83 55.51 -0.4 85.3 95.44
4 An American tail MAMA Tanya, Fievel? Will you stop that twirlin... tanya fievel stop twirl twirl time bed come ha... 2185 678 4.79 5.85 1100 3.75 1.81 ... 169 2.000000 3.45 1st and 2nd grade 126.99 123.47 54.85 -0.2 85.4 94.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
81 Twilight I'd never given much thought to how I would di... never give much thought die die place someone ... 3820 1108 5.50 5.93 1775 3.68 1.81 ... 438 2.714286 3.38 2nd and 3rd grade 127.90 125.08 54.90 -0.6 89.8 93.65
82 Up Movietown News presents Spotlight on Adventure... movietown news present spotlight adventure wit... 2545 865 4.86 4.59 1270 3.78 1.96 ... 304 8.285714 3.60 5th and 6th grade 127.39 124.63 54.46 -0.4 87.8 93.04
83 Venom Life Foundation Control, this is LF1. The spec... life foundation control lf specimen secure hea... 3390 1047 5.01 4.40 1740 3.72 1.90 ... 452 2.437500 3.70 2nd and 3rd grade 127.70 123.75 54.91 -0.4 88.6 94.17
84 Warm bodies What am I doing with my life? I'm so pale. I s... life pale get eat well posture terrible stand ... 1873 639 4.93 3.99 1014 3.63 1.86 ... 231 1.944444 3.50 1st and 2nd grade 127.50 126.09 55.78 -0.6 89.1 96.32
85 We are the Millers Oh, my God... ...it's full on double rainbow a... full double rainbow across sky whoa god whoo m... 6439 1609 4.88 4.21 3295 3.63 1.80 ... 608 1.789474 3.22 1st and 2nd grade 127.90 126.38 55.72 -0.7 90.8 96.27

86 rows × 60 columns

Target¶

In the target dataframe

  • 88 movies
  • 3 movies without subs
  • 85 movies with subs

In the df dataframe

  • 86 movies subs
title level subtitles kinopoisk
39 Lie to me (series) B1,B2 No NaN
47 Moulin Rouge 🎙️ A2/A2+,B1 No NaN
80 The Walking Dead (series)🧟 A2/A2+ No NaN

movie 'moulin rouge' has subtitles
we then merge the target df with previous movie dataframe

(86, 4)

Movie DataFrame (features and target) analysis¶

General info¶

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 61 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   title                         86 non-null     object  
 1   subs                          86 non-null     object  
 2   subs_lemma                    86 non-null     object  
 3   tok_cnt                       86 non-null     int64   
 4   uniq_tok_cnt                  86 non-null     int64   
 5   sent_len                      86 non-null     float64 
 6   sent_len_std                  86 non-null     float64 
 7   sent_cnt                      86 non-null     int64   
 8   tok_len                       86 non-null     float64 
 9   tok_len_std                   86 non-null     float64 
 10  a1_%                          86 non-null     int64   
 11  a2_%                          86 non-null     int64   
 12  b1_%                          86 non-null     int64   
 13  b2_%                          86 non-null     int64   
 14  c1_%                          86 non-null     int64   
 15  other_%                       86 non-null     int64   
 16  other_word                    86 non-null     object  
 17  lex_div                       86 non-null     float64 
 18  lemma_div                     86 non-null     float64 
 19  tree_height                   86 non-null     float64 
 20  tree_height_std               86 non-null     float64 
 21  max_tree_height               86 non-null     int32   
 22  UH                            86 non-null     int64   
 23  PRP                           86 non-null     int64   
 24  VBP                           86 non-null     int64   
 25  RB                            86 non-null     int64   
 26  VBG                           86 non-null     int64   
 27  DT                            86 non-null     int64   
 28  NNP                           86 non-null     int64   
 29  NN                            86 non-null     int64   
 30  CC                            86 non-null     int64   
 31  WP                            86 non-null     int64   
 32  JJ                            86 non-null     int64   
 33  IN                            86 non-null     int64   
 34  RP                            86 non-null     int64   
 35  PRP$                          86 non-null     int64   
 36  NNS                           86 non-null     int64   
 37  VBZ                           86 non-null     int64   
 38  VBN                           86 non-null     int64   
 39  TO                            86 non-null     int64   
 40  VB                            86 non-null     int64   
 41  VBD                           86 non-null     int64   
 42  MD                            86 non-null     int64   
 43  pos_tag_entrophy              86 non-null     float64 
 44  fleich_reading_ease           86 non-null     float64 
 45  flesch_kincaid_grade          86 non-null     float64 
 46  smog_index                    86 non-null     float64 
 47  coleman_liau_index            86 non-null     float64 
 48  automated_readability_index   86 non-null     float64 
 49  dale_chall_readability_score  86 non-null     float64 
 50  difficult_words               86 non-null     int64   
 51  linsear_write_formula         86 non-null     float64 
 52  gunning_fog                   86 non-null     float64 
 53  text_standard                 86 non-null     object  
 54  fernandez_huerta              86 non-null     float64 
 55  szigriszt_pazos               86 non-null     float64 
 56  gutierrez_polini              86 non-null     float64 
 57  crawford                      86 non-null     float64 
 58  gulpease_index                86 non-null     float64 
 59  osman                         86 non-null     float64 
 60  level                         86 non-null     category
dtypes: category(1), float64(23), int32(1), int64(31), object(5)
memory usage: 40.4+ KB
  • 55 features
  • compare to 86 samples, number of feature too big ==> could lead to overfitting
  • feature selection is required
  • EDA is required

Target¶

A2/A2+       26
A2/A2+,B1     5
B1           28
B1,B2         8
B2           19
Name: level, dtype: int64
  • imballanced target values.
  • Multiclass or multi labels? ==> multi class
  • would expect the difference between class are small, expecially the classes in middle, e.g. A2/A2+,B1 and B1,B2

target segmented statistics¶

24
level A2/A2+ A2/A2+,B1 B1 B1,B2 B2
sent_len mean 5.491923 6.232000 5.730000 6.196250 5.846316
std 0.689701 1.122751 0.701517 0.449410 0.699875
sent_len_std mean 5.198077 5.750000 5.180714 5.133750 5.194737
std 1.249189 1.657785 1.650582 0.597589 1.442407
tok_len mean 3.743462 3.826000 3.756429 3.778750 3.824737
std 0.091387 0.087920 0.087653 0.121236 0.126375
tok_len_std mean 1.870385 2.016000 1.919643 1.931250 1.972105
std 0.079974 0.061887 0.072953 0.101480 0.123629
lex_div mean 0.191231 0.165400 0.189000 0.177875 0.187579
std 0.033519 0.021373 0.029991 0.028713 0.037509
lemma_div mean 0.300462 0.265400 0.301107 0.282875 0.300158
std 0.048816 0.023330 0.043606 0.032100 0.051426
tree_height mean 2.870769 3.044000 2.953214 3.102500 3.010526
std 0.203390 0.255206 0.173804 0.197611 0.202772
tree_height_std mean 1.482308 1.606000 1.509286 1.540000 1.524737
std 0.165343 0.110136 0.165304 0.069488 0.143309
max_tree_height mean 10.923077 11.600000 12.607143 11.750000 12.105263
std 2.855494 0.894427 3.909526 2.251983 1.370107
pos_tag_entrophy mean 0.809500 0.810000 0.813107 0.809625 0.812000
std 0.016207 0.006364 0.005050 0.007596 0.005568
fleich_reading_ease mean 97.016923 95.526000 97.342857 96.607500 95.137368
std 2.418442 4.601786 1.691431 2.963481 4.102992
flesch_kincaid_grade mean 1.600000 1.920000 1.571429 1.662500 1.831579
std 0.491528 0.887130 0.409025 0.410357 0.629861
smog_index mean 5.996154 6.520000 6.092857 6.137500 6.236842
std 0.349263 0.432435 0.273426 0.512522 0.439963
coleman_liau_index mean 2.863846 3.516000 2.983571 3.032500 3.255263
std 0.607085 0.781044 0.599025 0.570507 0.756434
automated_readability_index mean 2.442308 2.980000 2.525000 2.612500 2.705263
std 0.462319 0.653452 0.455115 0.559177 0.575880
dale_chall_readability_score mean 5.358077 5.274000 5.400357 5.307500 5.480526
std 0.265812 0.174442 0.220042 0.288729 0.294759
linsear_write_formula mean 10.575046 4.446320 5.243355 3.918059 4.474143
std 18.220750 2.108565 9.484014 1.146797 5.076216
gunning_fog mean 3.661923 3.936000 3.722500 3.722500 3.794211
std 0.447464 0.395386 0.401586 0.264885 0.386355
fernandez_huerta mean 126.691923 125.498000 126.917500 126.401250 125.402105
std 1.846372 3.529599 1.347227 2.103239 2.958556
szigriszt_pazos mean 124.145385 121.686000 123.658571 123.226250 122.860526
std 1.762950 1.730832 1.637522 1.949234 2.362523
gutierrez_polini mean 54.633846 53.764000 54.427500 54.370000 53.970526
std 0.855944 0.973566 0.795681 1.012253 1.096259
crawford mean -0.350000 0.020000 -0.282143 -0.237500 -0.247368
std 0.343220 0.408656 0.370239 0.226385 0.367224
gulpease_index mean 86.673077 83.780000 86.435714 85.850000 86.547368
std 4.093073 4.931734 4.349561 1.990693 3.876621
osman mean 93.055000 90.592000 92.438571 92.348750 91.151053
std 2.442966 2.791249 2.311714 3.068478 3.147964

word count¶








Summary

  • there is a faint correlation with unique word count and the difficulty level

CEFR word percentages¶

a1_%
a2_%
b1_%
b2_%
c1_%
other_%
  • majority words are either in the A1 level or outside our vacabulary pool
  • within each moive difficulty level, the distribution of the word count is wide and outliers exist
  • only 20 % words are above A1 level
  • no clear correlation with difficulty level, maybe slight correlation

tree height¶

tree_height
tree_height_std
max_tree_height

Summary

  • almost no correlation between tree height and level
  • tree height std does not provide too much info

POS count¶

UH
PRP
VBP
RB
VBG
DT
NNP
NN
CC
WP
JJ
IN
RP
PRP$
NNS
VBZ
VBN
TO
VB
VBD
MD
pos_tag_entrophy
  • no clear correlation with the level

textstats¶

the Other subs words¶

take a look at the tokens which can't be categrozed by ourCEFR dictionary

4563

Modelling¶

Requirement already satisfied: statsmodels in c:\users\wang\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: patsy>=0.5.2 in c:\users\wang\anaconda3\lib\site-packages (from statsmodels) (0.5.2)
Requirement already satisfied: numpy>=1.17 in c:\users\wang\anaconda3\lib\site-packages (from statsmodels) (1.21.5)
Requirement already satisfied: packaging>=21.3 in c:\users\wang\anaconda3\lib\site-packages (from statsmodels) (21.3)
Requirement already satisfied: scipy>=1.3 in c:\users\wang\anaconda3\lib\site-packages (from statsmodels) (1.7.3)
Requirement already satisfied: pandas>=0.25 in c:\users\wang\anaconda3\lib\site-packages (from statsmodels) (1.4.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\users\wang\anaconda3\lib\site-packages (from packaging>=21.3->statsmodels) (3.0.4)
Requirement already satisfied: pytz>=2020.1 in c:\users\wang\anaconda3\lib\site-packages (from pandas>=0.25->statsmodels) (2021.3)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\wang\anaconda3\lib\site-packages (from pandas>=0.25->statsmodels) (2.8.2)
Requirement already satisfied: six in c:\users\wang\anaconda3\lib\site-packages (from patsy>=0.5.2->statsmodels) (1.16.0)
Requirement already satisfied: statsmodels in c:\users\wang\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: eli5 in c:\users\wang\anaconda3\lib\site-packages (0.13.0)
Requirement already satisfied: jinja2>=3.0.0 in c:\users\wang\anaconda3\lib\site-packages (from eli5) (3.1.2)
Requirement already satisfied: tabulate>=0.7.7 in c:\users\wang\anaconda3\lib\site-packages (from eli5) (0.8.10)
Requirement already satisfied: attrs>17.1.0 in c:\users\wang\anaconda3\lib\site-packages (from eli5) (21.4.0)
Requirement already satisfied: six in c:\users\wang\anaconda3\lib\site-packages (from eli5) (1.16.0)
Requirement already satisfied: graphviz in c:\users\wang\anaconda3\lib\site-packages (from eli5) (0.20.1)
Requirement already satisfied: scikit-learn>=0.20 in c:\users\wang\anaconda3\lib\site-packages (from eli5) (1.0.2)
Requirement already satisfied: numpy>=1.9.0 in c:\users\wang\anaconda3\lib\site-packages (from eli5) (1.21.5)
Requirement already satisfied: scipy in c:\users\wang\anaconda3\lib\site-packages (from eli5) (1.7.3)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\wang\anaconda3\lib\site-packages (from jinja2>=3.0.0->eli5) (2.0.1)
Requirement already satisfied: joblib>=0.11 in c:\users\wang\anaconda3\lib\site-packages (from scikit-learn>=0.20->eli5) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\wang\anaconda3\lib\site-packages (from scikit-learn>=0.20->eli5) (2.2.0)

features and targets

split data to training and testing, stratify = y

Training samples: 73
Testing samples: 13

Target values of training samples
B1           24
A2/A2+       22
B2           16
B1,B2         7
A2/A2+,B1     4
Name: level, dtype: int64

Target values of testing samples
A2/A2+       4
B1           4
B2           3
A2/A2+,B1    1
B1,B2        1
Name: level, dtype: int64

define function to

  • tune hyper parameters of selected model with weighted F1 as score
  • return tuned hyper parameters

define function to

  • split training datasets with statified k fold
  • cross validate a fine tuned model on each folds
  • print model's accuracys f1 score for each fold as well as the mean scores

SVM¶

Tune hyperparameters via cross validation¶

best weighted F1: 0.43823979591836737 with {'clf__C': 2.5999999999999996, 'clf__gamma': 'scale', 'clf__kernel': 'sigmoid'} 

cross validaton again with fined tuned parameter¶

For training
accuracys: [0.43, 0.48, 0.4, 0.56, 0.56, 0.47, 0.5, 0.56, 0.48, 0.45], Mean accuray: 0.49 
fscores: [0.4, 0.44, 0.37, 0.54, 0.52, 0.42, 0.46, 0.51, 0.45, 0.41], Mean fscores: 0.45 

For testing
accuracys: [0.5, 0.88, 0.62, 0.57, 0.29, 0.43, 0.29, 0.29, 0.57, 0.57], Mean accuray: 0.50 
fscores: [0.44, 0.83, 0.53, 0.56, 0.26, 0.34, 0.16, 0.21, 0.53, 0.51], Mean fscores: 0.44 

Feature importance¶

  • features which increase the metrics more than 0.01
Weight Feature
0.0229 ± 0.0184 linsear_write_formula
0.0220 ± 0.0250 PRP$
0.0204 ± 0.0126 automated_readability_index
0.0203 ± 0.0276 sent_len
0.0202 ± 0.0299 RB
0.0178 ± 0.0340 NNS
0.0158 ± 0.0098 NNP
0.0155 ± 0.0114 c1_%
0.0145 ± 0.0190 tree_height_std
0.0138 ± 0.0330 VBD
0.0127 ± 0.0000 NN
0.0127 ± 0.0002 coleman_liau_index
0.0125 ± 0.0158 fernandez_huerta
0.0121 ± 0.0282 b2_%
0.0102 ± 0.0191 szigriszt_pazos
0.0101 ± 0.0188 smog_index
0.0099 ± 0.0187 DT
0.0098 ± 0.0191 other_%
0.0096 ± 0.0380 tok_len
0.0096 ± 0.0110 tree_height
… 35 more …
['sent_len',
 'b2_%',
 'c1_%',
 'tree_height_std',
 'RB',
 'NNP',
 'NN',
 'PRP$',
 'NNS',
 'VBD',
 'smog_index',
 'coleman_liau_index',
 'automated_readability_index',
 'linsear_write_formula',
 'fernandez_huerta',
 'szigriszt_pazos']

Logistic Regression¶

Tune hyperparameters via cross validation¶

best weighted F1: 0.41553571428571434 with {'clf__C': 1, 'clf__class_weight': 'balanced', 'clf__multi_class': 'multinomial', 'clf__penalty': 'l1', 'clf__solver': 'saga'} 

cross validaton again with fined tuned parameter¶

For training
accuracys: [0.78, 0.82, 0.75, 0.82, 0.77, 0.77, 0.76, 0.74, 0.79, 0.79], Mean accuray: 0.78 
fscores: [0.78, 0.82, 0.75, 0.82, 0.77, 0.77, 0.76, 0.74, 0.79, 0.78], Mean fscores: 0.78 

For testing
accuracys: [0.38, 0.62, 0.5, 0.29, 0.43, 0.29, 0.57, 0.29, 0.29, 0.71], Mean accuray: 0.44 
fscores: [0.4, 0.68, 0.47, 0.21, 0.37, 0.26, 0.54, 0.29, 0.29, 0.66], Mean fscores: 0.42 

Feature importance¶

Weight Feature
0.1653 ± 0.0837 pos_tag_entrophy
0.1607 ± 0.0586 tok_len_std
0.1361 ± 0.0884 VBD
0.1219 ± 0.0255 c1_%
0.0912 ± 0.0458 b2_%
0.0892 ± 0.0319 VBN
0.0802 ± 0.0661 tok_len
0.0619 ± 0.0276 b1_%
0.0545 ± 0.0008 JJ
0.0468 ± 0.0359 NNP
0.0443 ± 0.0230 lemma_div
0.0433 ± 0.0538 max_tree_height
0.0423 ± 0.0249 VB
0.0296 ± 0.0336 other_%
0.0283 ± 0.0376 RB
0.0277 ± 0.0243 PRP$
0.0251 ± 0.0438 VBG
0.0200 ± 0.0451 fleich_reading_ease
0.0183 ± 0.0299 tree_height
0.0122 ± 0.0497 gulpease_index
… 35 more …

KNN¶

Tune hyperparameters via cross validation¶

best weighted F1: 0.417984693877551 with {'clf__algorithm': 'auto', 'clf__n_neighbors': 7, 'clf__p': 1, 'clf__weights': 'distance'} 

cross validation¶

For training
accuracys: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], Mean accuray: 1.00 
fscores: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], Mean fscores: 1.00 

For testing
accuracys: [0.62, 0.62, 0.5, 0.71, 0.29, 0.43, 0.71, 0.14, 0.14, 0.43], Mean accuray: 0.46 
fscores: [0.59, 0.63, 0.39, 0.71, 0.21, 0.4, 0.66, 0.07, 0.08, 0.43], Mean fscores: 0.42 

Feature importance¶

Weight Feature
0 ± 0.0000 sent_len
0 ± 0.0000 NNP
0 ± 0.0000 DT
0 ± 0.0000 VBG
0 ± 0.0000 RB
0 ± 0.0000 VBP
0 ± 0.0000 PRP
0 ± 0.0000 UH
0 ± 0.0000 max_tree_height
0 ± 0.0000 tree_height_std
0 ± 0.0000 tree_height
0 ± 0.0000 CC
0 ± 0.0000 uniq_tok_cnt
0 ± 0.0000 other_%
0 ± 0.0000 tok_len
0 ± 0.0000 sent_len_std
0 ± 0.0000 lemma_div
0 ± 0.0000 sent_cnt
0 ± 0.0000 NN
0 ± 0.0000 osman
… 35 more …

Random Forest¶

Tune hyperparameters via cross validation¶

best weighted F1: 0.426875 with {'max_depth': 5, 'max_features': 'auto', 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 100, 'random_state': 12345} 
{'max_depth': 5,
 'max_features': 'auto',
 'min_samples_leaf': 4,
 'min_samples_split': 2,
 'n_estimators': 100,
 'random_state': 12345}

cross validation¶

For training
accuracys: [0.86, 0.83, 0.8, 0.85, 0.82, 0.8, 0.85, 0.89, 0.85, 0.86], Mean accuray: 0.84 
fscores: [0.84, 0.81, 0.76, 0.82, 0.79, 0.76, 0.84, 0.88, 0.82, 0.84], Mean fscores: 0.82 

For testing
accuracys: [0.38, 0.62, 0.5, 0.57, 0.29, 0.43, 0.71, 0.14, 0.29, 0.43], Mean accuray: 0.44 
fscores: [0.34, 0.6, 0.39, 0.52, 0.22, 0.34, 0.66, 0.1, 0.26, 0.43], Mean fscores: 0.39 

Feature importance¶

Weight Feature
0.0618 ± 0.0286 tree_height
0.0307 ± 0.0458 VBG
0.0235 ± 0.0187 RB
0.0233 ± 0.0293 tok_len_std
0.0221 ± 0.0319 VBD
0.0202 ± 0.0283 PRP
0.0137 ± 0.0151 b2_%
0.0129 ± 0.0000 UH
0.0129 ± 0.0001 tok_cnt
0.0125 ± 0.0000 WP
0.0123 ± 0.0312 pos_tag_entrophy
0.0087 ± 0.0123 tok_len
0.0078 ± 0.0125 a1_%
0.0077 ± 0.0126 NNS
0.0068 ± 0.0223 difficult_words
0.0067 ± 0.0247 max_tree_height
0.0057 ± 0.0126 RP
0.0054 ± 0.0121 coleman_liau_index
0.0052 ± 0.0125 VBZ
0.0052 ± 0.0126 gunning_fog
… 35 more …

SVC, KNN, RFR with feature selection¶

selected features

{'JJ',
 'NN',
 'NNP',
 'NNS',
 'PRP',
 'PRP$',
 'RB',
 'UH',
 'VB',
 'VBD',
 'VBG',
 'VBN',
 'WP',
 'automated_readability_index',
 'b1_%',
 'b2_%',
 'c1_%',
 'coleman_liau_index',
 'difficult_words',
 'fernandez_huerta',
 'fleich_reading_ease',
 'gulpease_index',
 'lemma_div',
 'linsear_write_formula',
 'max_tree_height',
 'other_%',
 'pos_tag_entrophy',
 'sent_len',
 'smog_index',
 'szigriszt_pazos',
 'tok_cnt',
 'tok_len',
 'tok_len_std',
 'tree_height',
 'tree_height_std'}

SVC evaluation¶

best weighted F1: 0.4418452380952381 with {'clf__C': 1.0, 'clf__gamma': 'scale', 'clf__kernel': 'rbf'} 
Accuracy score: 0.38

A2/A2+ A2/A2+,B1 B1 B1,B2 B2 weighted_avg
precision 0.500000 0.0 0.400000 0.0 0.333333 0.353846
recall 0.250000 0.0 0.500000 0.0 0.666667 0.384615
fscore 0.333333 0.0 0.444444 0.0 0.444444 0.341880
support 4.000000 1.0 4.000000 1.0 3.000000 NaN

KNN evaluation¶

best weighted F1: 0.4195068027210884 with {'clf__algorithm': 'auto', 'clf__n_neighbors': 10, 'clf__p': 1, 'clf__weights': 'distance'} 
Accuracy score: 0.54
A2/A2+ A2/A2+,B1 B1 B1,B2 B2 weighted_avg
precision 0.666667 0.0 0.50 0.0 0.500000 0.474359
recall 0.500000 0.0 0.75 0.0 0.666667 0.538462
fscore 0.571429 0.0 0.60 0.0 0.571429 0.492308
support 4.000000 1.0 4.00 1.0 3.000000 NaN

RFR evaluation¶

best weighted F1: 0.42688350340136055 with {'max_depth': None, 'max_features': 'auto', 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200, 'random_state': 12345} 
{'max_depth': None,
 'max_features': 'auto',
 'min_samples_leaf': 4,
 'min_samples_split': 2,
 'n_estimators': 200,
 'random_state': 12345}
Accuracy score: 0.54
A2/A2+ A2/A2+,B1 B1 B1,B2 B2 weighted_avg
precision 0.666667 0.0 0.50 0.0 0.500000 0.474359
recall 0.500000 0.0 0.75 0.0 0.666667 0.538462
fscore 0.571429 0.0 0.60 0.0 0.571429 0.492308
support 4.000000 1.0 4.00 1.0 3.000000 NaN

Oridnal Regression¶

OrderedModel Results
Dep. Variable: level Log-Likelihood: -59.047
Model: OrderedModel AIC: 196.1
Method: Maximum Likelihood BIC: 285.4
Date: Fri, 29 Jul 2022
Time: 14:24:26
No. Observations: 73
Df Residuals: 34
Df Model: 39
coef std err z P>|z| [0.025 0.975]
x1 -4.4566 1.980 -2.251 0.024 -8.337 -0.576
x2 0.6498 0.539 1.205 0.228 -0.407 1.707
x3 4.0730 1.803 2.259 0.024 0.538 7.607
x4 0.5044 1.040 0.485 0.628 -1.534 2.543
x5 -1.7257 0.758 -2.278 0.023 -3.211 -0.241
x6 0.8341 0.503 1.658 0.097 -0.152 1.820
x7 -0.0227 0.871 -0.026 0.979 -1.731 1.685
x8 0.4895 0.341 1.436 0.151 -0.179 1.158
x9 1.5773 1.106 1.425 0.154 -0.591 3.746
x10 -1.1327 0.663 -1.708 0.088 -2.432 0.167
x11 -0.2056 0.354 -0.582 0.561 -0.898 0.487
x12 -0.0741 1.257 -0.059 0.953 -2.538 2.390
x13 -0.2864 0.795 -0.360 0.719 -1.845 1.272
x14 -0.9263 0.591 -1.566 0.117 -2.085 0.233
x15 -0.0761 1.178 -0.065 0.949 -2.384 2.232
x16 1.9060 0.598 3.187 0.001 0.734 3.078
x17 1.4269 0.397 3.598 0.000 0.650 2.204
x18 2.5512 0.834 3.059 0.002 0.916 4.186
x19 0.4153 0.777 0.535 0.593 -1.107 1.938
x20 -0.2403 0.622 -0.386 0.699 -1.460 0.979
x21 8.7552 9.940 0.881 0.378 -10.728 28.238
x22 0.9560 1.806 0.529 0.597 -2.585 4.497
x23 1.5971 0.880 1.815 0.070 -0.128 3.322
x24 -0.3622 0.894 -0.405 0.685 -2.114 1.390
x25 -3.2238 1.384 -2.329 0.020 -5.936 -0.511
x26 -10.5873 10.605 -0.998 0.318 -31.372 10.197
x27 2.6472 1.878 1.410 0.159 -1.033 6.328
x28 1.0480 0.569 1.842 0.065 -0.067 2.163
x29 1.3284 0.770 1.724 0.085 -0.181 2.838
x30 -1.0771 0.697 -1.546 0.122 -2.442 0.288
x31 3.6436 4.658 0.782 0.434 -5.485 12.772
x32 -0.6734 0.542 -1.242 0.214 -1.736 0.390
x33 -1.9469 1.240 -1.570 0.116 -4.377 0.483
x34 -0.0381 1.129 -0.034 0.973 -2.251 2.175
x35 0.5837 1.714 0.341 0.733 -2.775 3.942
A2/A2+/A2/A2+,B1 -1.1735 0.273 -4.294 0.000 -1.709 -0.638
A2/A2+,B1/B1 -1.1064 0.478 -2.313 0.021 -2.044 -0.169
B1/B1,B2 0.6221 0.186 3.344 0.001 0.257 0.987
B1,B2/B2 -0.2711 0.345 -0.787 0.431 -0.946 0.404
accuracy score 0.54

Summary¶

  • we cleaned and parsed subtitles from 86 movies into numerical features
  • from the Movie df analysis we have two observations
    • features do not correlate clearly with the target
    • the distribution of each features is very wide, even for movies with same label
  • several classifiers were experimented via hyperparameters tuning and cross validation
  • Feature importance tests allow us to select features
  • We developed several ml solutions to predicit the difficulty level
    • KNN and RFR models score the highest accuracy of 0.54 with a weighted F1 score of 0.5
  • results not ideal
    • highly imballanced data
    • the features do not provide enough infos, as the correlation with labels is weak
    • not enough samples for a 5 class classification

Outlook¶

  • try to get more data (86 labeled samples are too less)
  • use neural network
  • use semi supervised machine learning
[NbConvertApp] Converting notebook 100_films.ipynb to html
[NbConvertApp] Writing 4886843 bytes to 100_films.html